Joey Hess [Thu, 1 Jun 2023 17:46:16 +0000 (13:46 -0400)]
use importChanges optimisation
Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
Benchmarks:
Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.
An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.
Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.
Sponsored-by: k0ld on Patreon
Joey Hess [Wed, 31 May 2023 20:34:03 +0000 (16:34 -0400)]
Merge branch 'master' of ssh://git-annex.branchable.com
Joey Hess [Wed, 31 May 2023 19:45:23 +0000 (15:45 -0400)]
implement importChanges optimisaton (not used yet)
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.
I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.
Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.
Sponsored-by: Brett Eisenberg on Patreon
Joey Hess [Wed, 31 May 2023 16:31:14 +0000 (12:31 -0400)]
build git trees using ContentIdentifier to speed up import
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.
Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.
Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.
A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.
Sponsored-by: Luke Shumaker on Patreon
Joey Hess [Tue, 30 May 2023 21:19:23 +0000 (17:19 -0400)]
update
Joey Hess [Tue, 30 May 2023 21:05:28 +0000 (17:05 -0400)]
avoid import writing to cidsdb initially
Speed up importing trees from special remotes somewhat by avoiding
redundant writes to sqlite database.
Before, import would write to both the git-annex branch and also to the
sqlite database. But then the next time it was run, needsUpdateFromLog
would see the branch had changed, so run updateFromLog, which would make
the same writes to the sqlite database a second time.
Now import writes only to the git-annex branch. The next time it's run,
needsUpdateFromLog sees that the branch has changed and so calls
updateFromLog, which updates the sqlite database.
Why defer the write to the sqlite database like this? It seems that it
could write to the database as it goes, and at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second
import doing extra work.
But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the import then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.
I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end. So it might as well
wait until the next run and do updateFromLog then. It could instead do
updateFromLog at the end, but it's going to check needsUpdateFromLog
at the beginning anyway.
Note that the database writes were queued, so there is already a cidmap
that is used to remember changes that the current process has made.
So, omitting database writes can't change the behavior of the current
process.
Also note that thirdpartypopulatedimport uses recordcidkeyindb, which
reflects what it already did. That code path does not use the cidmap,
but does not need to query it either. It might be possible to make that
code path also only update the git-annex branch and not the db, but I
haven't checked.
Sponsored-by: Noam Kremen on Patreon
jgoerzen [Tue, 30 May 2023 20:58:21 +0000 (20:58 +0000)]
Added a comment
Joey Hess [Tue, 30 May 2023 20:11:29 +0000 (16:11 -0400)]
improve test descriptions
Joey Hess [Tue, 30 May 2023 20:09:13 +0000 (16:09 -0400)]
repair: Fix handling of git ref names on Windows
Sponsored-by: Kevin Mueller on Patreon
Joey Hess [Tue, 30 May 2023 19:49:52 +0000 (15:49 -0400)]
update
Joey Hess [Tue, 30 May 2023 19:42:34 +0000 (15:42 -0400)]
comment and a neat idea
Joey Hess [Tue, 30 May 2023 19:42:11 +0000 (15:42 -0400)]
tab indentation
Joey Hess [Tue, 30 May 2023 18:30:39 +0000 (14:30 -0400)]
comment
jgoerzen [Tue, 30 May 2023 12:23:28 +0000 (12:23 +0000)]
jgoerzen [Tue, 30 May 2023 00:37:10 +0000 (00:37 +0000)]
jgoerzen [Tue, 30 May 2023 00:35:54 +0000 (00:35 +0000)]
Mowgli [Mon, 29 May 2023 22:42:13 +0000 (22:42 +0000)]
Added a comment: Use locales for that porpose
Daniel Höxtermann [Sun, 28 May 2023 05:12:15 +0000 (07:12 +0200)]
Add borg2annex to related_software
Joey Hess [Sat, 27 May 2023 17:09:48 +0000 (13:09 -0400)]
Merge branch 'master' of ssh://git-annex.branchable.com
Joey Hess [Sat, 27 May 2023 16:45:16 +0000 (12:45 -0400)]
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
Joey Hess [Wed, 24 May 2023 18:04:09 +0000 (14:04 -0400)]
assist: honor gitignore
Sponsored-by: Graham Spencer on Patreon
nobodyinperson [Wed, 24 May 2023 14:59:40 +0000 (14:59 +0000)]
yarikoptic [Tue, 23 May 2023 16:12:42 +0000 (16:12 +0000)]
Added a comment
Joey Hess [Tue, 23 May 2023 16:00:01 +0000 (12:00 -0400)]
comment
Joey Hess [Tue, 23 May 2023 15:46:54 +0000 (11:46 -0400)]
document -m
Joey Hess [Tue, 23 May 2023 15:45:17 +0000 (11:45 -0400)]
comment
Mowgli [Tue, 23 May 2023 13:10:45 +0000 (13:10 +0000)]
nobodyinperson [Mon, 22 May 2023 11:22:42 +0000 (11:22 +0000)]
nobodyinperson [Sat, 20 May 2023 06:03:30 +0000 (06:03 +0000)]
Added a comment
yarikoptic [Fri, 19 May 2023 19:22:04 +0000 (19:22 +0000)]
Added a comment
yarikoptic [Fri, 19 May 2023 19:06:01 +0000 (19:06 +0000)]
Added a comment
Joey Hess [Fri, 19 May 2023 19:00:57 +0000 (15:00 -0400)]
version: Avoid error message when entire output is not read
Sponsored-by: Dartmouth College's Datalad project
Joey Hess [Fri, 19 May 2023 18:54:09 +0000 (14:54 -0400)]
Merge branch 'master' of ssh://git-annex.branchable.com
Joey Hess [Fri, 19 May 2023 18:53:18 +0000 (14:53 -0400)]
comment
yarikoptic [Fri, 19 May 2023 18:49:48 +0000 (18:49 +0000)]
Added a comment
yarikoptic [Fri, 19 May 2023 18:49:29 +0000 (18:49 +0000)]
Added a comment
yarikoptic [Fri, 19 May 2023 18:47:49 +0000 (18:47 +0000)]
Added a comment
Joey Hess [Fri, 19 May 2023 18:47:05 +0000 (14:47 -0400)]
assist: operate on all files in working tree by default
Consistency with sync and internal consistency is more important than
consistency with the assistant, which is not itself consistent about
what it does when run in a subdirectory.
Note that with -C, it will still commit staged changes to files outside
the directory. Like sync does. Presumably if the user is manually
staging things, then running this command, they intend to build up a
commit.
Sponsored-by: unqueued on Patreon
Joey Hess [Fri, 19 May 2023 18:34:02 +0000 (14:34 -0400)]
Fix bug in -z handling of trailing NUL in input
The obvious way to fix this would be to adapt lines to split on null.
However, it's actually nontrivial to rewrite lines. In particular it has a
weird implementation to avoid a space leak. See:
https://gitlab.haskell.org/ghc/ghc/-/issues/4334
Also, while that is a small amount of code, it's covered by a rather
complex copyright and I'd have to include that copyright in git-annex.
So, I opted to filter out the trailing empty string instead.
Sponsored-by: Dartmouth College's Datalad project
Joey Hess [Fri, 19 May 2023 17:53:21 +0000 (13:53 -0400)]
comment
yarikoptic [Fri, 19 May 2023 14:36:46 +0000 (14:36 +0000)]
question about annotating availability in the snapshot
yarikoptic [Fri, 19 May 2023 13:53:46 +0000 (13:53 +0000)]
dropkey -z not working
nobodyinperson [Fri, 19 May 2023 05:54:44 +0000 (05:54 +0000)]
Added a comment: 👍 git annex assist
Joey Hess [Thu, 18 May 2023 19:02:10 +0000 (15:02 -0400)]
assist: fix bug committing just added file when -J is used
Need to wait for worker threads adding files before flushing the queue.
Joey Hess [Thu, 18 May 2023 18:56:13 +0000 (14:56 -0400)]
assist: fix bug commiting just added file
Joey Hess [Thu, 18 May 2023 18:50:05 +0000 (14:50 -0400)]
fix assist to commit
Joey Hess [Thu, 18 May 2023 18:41:20 +0000 (14:41 -0400)]
improve --cleanup desc
Joey Hess [Thu, 18 May 2023 18:37:29 +0000 (14:37 -0400)]
git-annex assist
assist: New command, which is the same as git-annex sync but with
new files added and content transferred by default.
(Also this fixes another reversion in git-annex sync,
--commit --no-commit, and --message were not enabled, oops.)
See added comment for why git-annex assist does commit staged
changes elsewhere in the work tree, but only adds files under
the cwd.
Note that it does not support --no-commit, --no-push, --no-pull
like sync does. My thinking is, why should it? If you want that
level of control, use git commit, git annex push, git annex pull.
Sync only got those options because pull and push were not split
out.
Sponsored-by: k0ld on Patreon
Joey Hess [Thu, 18 May 2023 16:54:15 +0000 (12:54 -0400)]
add man pages for pull and push to cabal file
Joey Hess [Thu, 18 May 2023 15:19:59 +0000 (11:19 -0400)]
add test summary with number of parts and time
Sponsored-by: Brock Spratlen on Patreon
Joey Hess [Thu, 18 May 2023 14:55:11 +0000 (10:55 -0400)]
Merge branch 'master' of ssh://git-annex.branchable.com
Joey Hess [Thu, 18 May 2023 14:15:04 +0000 (10:15 -0400)]
fix inverted logic (fixes test fail)
Sponsored-by: Jack Hill on Patreon
Joey Hess [Thu, 18 May 2023 13:57:22 +0000 (09:57 -0400)]
update test suite to sync --no-content
A recently added warning and a plan to change behavior make it a good
idea to be explicit here.
nobodyinperson [Thu, 18 May 2023 08:40:37 +0000 (08:40 +0000)]
Added a comment
Joey Hess [Wed, 17 May 2023 17:41:33 +0000 (13:41 -0400)]
update
Joey Hess [Wed, 17 May 2023 17:33:47 +0000 (13:33 -0400)]
Merge branch 'master' of ssh://git-annex.branchable.com
Joey Hess [Wed, 17 May 2023 17:33:42 +0000 (13:33 -0400)]
comment
Joey Hess [Wed, 17 May 2023 17:32:47 +0000 (13:32 -0400)]
improve docs
The man pages for these were not really clear that they add new files.
Joey Hess [Wed, 17 May 2023 17:23:42 +0000 (13:23 -0400)]
sync: Started transition to --content being enabled by default
When used without --content or --no-content, warn about the upcoming
transition, and suggest using one of the options, or setting
annex.synccontent.
Sponsored-by: Brett Eisenberg on Patreon
Joey Hess [Wed, 17 May 2023 16:46:22 +0000 (12:46 -0400)]
push: Support --cleanup
This option is not specific to sync, so it seemed it should be in either
pull or push as well as sync. Since it does modify the remote, it seems
better to have it in push; the modification of the local repo pulls in
the direction of pull, but not hard enough.
Maybe it would be better to have it in both?
Sponsored-by: Luke Shumaker on Patreon
Joey Hess [Wed, 17 May 2023 16:38:22 +0000 (12:38 -0400)]
improve description of --debugfilter
Joey Hess [Wed, 17 May 2023 16:33:57 +0000 (12:33 -0400)]
sync: Added -g as a short option for --no-content
I anticipate that if sync is transitioned to syncing content by default,
people will want a short option. And in repositories where
annex.synccontent = true, they already would. And pull and push sync
content by default, so a short option is useful with them too.
Mnemonic: -g makes only git data be synced
Also, -a makes only annex data be synced.
Would have preferred -c, which would complement -C, but it
was already taken to set git configs.
Sponsored-by: Noam Kremen on Patreon
nobodyinperson [Wed, 17 May 2023 10:41:18 +0000 (10:41 +0000)]
Added a comment
Joey Hess [Tue, 16 May 2023 20:37:30 +0000 (16:37 -0400)]
git-annex pull and push
Split out two new commands, git-annex pull and git-annex push. Those plus a
git commit are equivilant to git-annex sync.
In a sense, git-annex sync conflates 3 things, and it would have been
better to have push and pull from the beginning and not sync. Although
note that git-annex sync --content is faster than a pull followed by a
push, because it only has to walk the tree once, look at preferred
content once, etc. So there is some value in git-annex sync in speed, as
well as user convenience.
And it would be hard to split out pull and push from sync, as far as the
implementaton goes. The implementation inside sync was easy, just adjust
SyncOptions so it does the right thing.
Note that the new commands default to syncing content, unless
annex.synccontent is explicitly set to false. I'd like sync to also do
that, but that's a hard transition to make. As a start to that
transition, I added a note to git-annex-sync.mdwn that it may start to
do so in a future version of git-annex. But a real transition would
necessarily involve displaying warnings when sync is used without
--content, and time.
Sponsored-by: Kevin Mueller on Patreon
Joey Hess [Tue, 16 May 2023 20:33:02 +0000 (16:33 -0400)]
remove unused imports
Joey Hess [Tue, 16 May 2023 20:25:23 +0000 (16:25 -0400)]
sync --no-pull and --no-push affect download and upload of content
The man page is somewhat vague about this, but I do think it was a bug
that these options didn't alreay behave that way. The options are
documented to disable imports and exports, which is the same operations
just with a special remote that uses trees.
The real motivation for this is that I'm adding git-annex pull and
git-annex push, and I want these options to turn off the equivilant of
those commands. And git-annex pull will certianly download and push
upload.
Sponsored-by: Nicholas Golder-Manning on Patreon
Joey Hess [Tue, 16 May 2023 19:55:24 +0000 (15:55 -0400)]
pullOption should be pushOption in seekExportContent
sync: Fix bug that made --no-pull, rather than --no-push prevent exporting
trees to special remotes.
Sponsored-by: Joshua Antonishen on Patreon
Joey Hess [Tue, 16 May 2023 17:08:43 +0000 (13:08 -0400)]
comment
nobodyinperson [Mon, 15 May 2023 21:13:34 +0000 (21:13 +0000)]
Added a comment
Joey Hess [Mon, 15 May 2023 20:22:11 +0000 (16:22 -0400)]
remove spam
Joey Hess [Mon, 15 May 2023 20:19:53 +0000 (16:19 -0400)]
comment
Joey Hess [Mon, 15 May 2023 20:00:30 +0000 (16:00 -0400)]
document how to include= a path with a space in it
POSIX character classes allowed in globs was a surprise, but just
happened to fall out of the implementation in a way that seems
to behave correctly.
mdwn2man has to be tweaked to render the example properly.
The line I modified is the one that strips ikiwiki wikilinks out of the
man page.
Sponsored-by: Graham Spencer on Patreon
Joey Hess [Mon, 15 May 2023 19:35:29 +0000 (15:35 -0400)]
add git config debugging
(and process cwd debugging)
Sponsored-by: Dartmouth College's Datalad project
nobodyinperson [Mon, 15 May 2023 07:31:41 +0000 (07:31 +0000)]
Added a comment: Git Alias for a 'full sync'
aurtzy [Sun, 14 May 2023 02:52:38 +0000 (02:52 +0000)]
Added a comment
yarikoptic [Fri, 12 May 2023 19:06:50 +0000 (19:06 +0000)]
original issue -- need more logging
yarikoptic [Fri, 12 May 2023 13:22:19 +0000 (13:22 +0000)]
Added a comment
yarikoptic [Fri, 12 May 2023 13:22:01 +0000 (13:22 +0000)]
Added a comment
Joey Hess [Thu, 11 May 2023 17:57:59 +0000 (13:57 -0400)]
finished this
Joey Hess [Thu, 11 May 2023 17:52:22 +0000 (13:52 -0400)]
clean up uninit output
Don't think including the location of .git/annex/objects in the json is
really useful.
Joey Hess [Thu, 11 May 2023 17:50:20 +0000 (13:50 -0400)]
uninit: remove unncessary ExistSuccess
That was added in 2011 to prevent writing to the git-annex branch on
shutdown. But, the use of saveState causes pending git-annex branch
writes to be completed before the branch is deleted. So, an unusual exit
is not needed.
Joey Hess [Thu, 11 May 2023 17:36:59 +0000 (13:36 -0400)]
uninit: Support --json and --json-error-messages
Had to convert uninit to do everything that can error out inside a
CommandStart. This was harder than feels nice.
(Also, in passing, converted CommandCheck to use a data type, not a
weird number that it was not clear how it managed to be unique.)
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Thu, 11 May 2023 17:26:55 +0000 (13:26 -0400)]
fix typo
Joey Hess [Thu, 11 May 2023 17:25:46 +0000 (13:25 -0400)]
uninit: Avoid buffering the names of all annexed files in memory
Oops, using the same list twice does prevent streaming in constant memory.
Sponsored-by: unqueued on Patreon
Joey Hess [Thu, 11 May 2023 17:24:34 +0000 (13:24 -0400)]
remove unused import
Joey Hess [Thu, 11 May 2023 17:20:35 +0000 (13:20 -0400)]
uninit: remove undocumented suport for specifying files to act on
I think this was just copied from another command without paying
attention to what it did, because there does not seem to be any valid
reason to want to only unannex some files when running uninit.
Joey Hess [Wed, 10 May 2023 18:19:32 +0000 (14:19 -0400)]
configremote: Support --json and --json-error-messages
Seems unlikely to be too useful, but who knows.
Moved the checkSafeConfig call to happen after an action is started, so
it will be captured by --json-error-messages
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Wed, 10 May 2023 18:09:27 +0000 (14:09 -0400)]
enableremote: Support --json and --json-error-messages
Seems unlikely to be too useful, but who knows. Was trivial anyway.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Wed, 10 May 2023 18:01:46 +0000 (14:01 -0400)]
initremote: Support --json and --json-error-messages
Including special --whatelse handling.
Otherwise, it seems unlikely to be too useful, but who knows.
Refactored code to call starting before displaying error messages.
This makes the error messages be captured by --json-error-messages
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Wed, 10 May 2023 17:30:43 +0000 (13:30 -0400)]
support aeson for Map
Make unused --json use it, which is better than the doubly nested lists
it was using.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Wed, 10 May 2023 16:41:43 +0000 (12:41 -0400)]
upgrade: Support --json and --json-error-messages and --json-progress
Seems unlikely to be very useful, but trivial.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Wed, 10 May 2023 16:32:00 +0000 (12:32 -0400)]
merge: Support --json and --json-error-messages and --json-progress
Seems unlikely to be very useful, but trivial.
And, this completes the story that git-annex sync does not need json,
since every sub-operation is available in a command that does support json.
(Well, except for committing, but that's not a git-annex command.)
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Added a comment
Joey Hess [Tue, 9 May 2023 20:59:44 +0000 (16:59 -0400)]
include url in json output
The input field is consistently the url of the feed, which makes sense
as that is the user input, but to differentiate multiple urls downloaded
from a feed when using --json-progress -J, need the url that is being
downloaded too.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Tue, 9 May 2023 20:43:16 +0000 (16:43 -0400)]
importfeed: Support --json and --json-error-messages and --json-progress
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Tue, 9 May 2023 20:22:09 +0000 (16:22 -0400)]
importfeed: Move error to where --json-error-messages can capture it
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Tue, 9 May 2023 19:49:05 +0000 (15:49 -0400)]
importfeed: Support -J (and work toward supporting --json)
Both -J and --json needed importfeed to be refactored to use commandAction.
That was difficult, because of the interrelated nature of downloading feeds
and then downloading files from feeds, both of which needed to use
commandAction. And then checking for problems in feeds has to come after
these actions, which may be run as background jobs.
As for --json support, it's most of the way there, but still has some
warts, so I didn't enable jsonOptions yet. The warts include:
- An initial empty json record is displayed by getCache.
- Input is not populated, should be feed url
- feedProblem at end will not be captured by --json-error-messages
(see FIXME)
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Mon, 8 May 2023 20:25:40 +0000 (16:25 -0400)]
renameremote: Support --json and --json-error-messages
Seems unlikely to be useful, but it works so
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Joey Hess [Mon, 8 May 2023 20:03:34 +0000 (16:03 -0400)]
factor out maybeAddJSONField
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project